llama : refactor llama_kv_cache, llama_context and llm_build_context #11213

ggerganov · 2025-01-13T12:18:09Z

Overview

This PR is an intermediate step towards a more generic implementation that will support different underlying implementations of llama_kv_cache, llama_context and the graph building logic (a.k.a. llm_build_context). The llama_kv_cache is also introduced in the public API as an object, but it's actual functionality is yet to be defined in follow-up PRs.

Currently, no functional changes have been introduced. Mainly the code has been reorganized in a way to allow implementing new abstractions. The main changes in the implementation are:

Avoid all explicit references to llama_kv_cache in llm_build_context. The goal is to be able to construct the computation graphs only through the abstract llama_context interface, which will hide the actual KV cache implementation and thus allow to be overloaded based on the parameters of the specific use case. More generally, the llama_context hides not only the KV cache implementation, but all the internal state (such as, applied adapters, masks, etc. if any) with the exception of the model weights - these are still available to the llm_build_context in order to be able to construct the backbone graph of the various architectures.
Avoid all explicit references to llama_kv_cache in llama_decode/llama_encode. These are abstracted through a new object llama_batch_manager which is produced by the current llama_context. Again the goal is to not make explicit assumptions about the underlying KV cache implementation while processing the batches and be able to delegate this logic to the llama_context. The llama_batch_manager is produced by the llama_context and will handle logic such as, restoring the KV cache state to consistent state upon errors, batching the input batch into micro batches according to the internal processing logic, etc.
Add initial serialization primitives to llama_kv_cache. In the future, these will be overloaded for the specific KV cache implementations through a common abstract interface.

The modifications so far are quite substantial and touch too many lines. Even though the code is in a very intermediate state, with many members still publicly exposed and without proper object-oriented implementation in place, it should still be mergeable.

The general class hierarchy that I have in mind is like this:

graph TD;
llama_kv_cache_unified --> llama_kv_cache;
llama_kv_cache_standard --> llama_kv_cache;
llama_kv_cache_mamba --> llama_kv_cache;
... --> llama_kv_cache;

Here, llama_kv_cache_unified is basically the llama_kv_cache implementation that we currently have. In the future, we will add more implementations that would be appropriate for multi-user scenarios (e.g. llama_kv_cache_standard) or for Mamba architectures (llama_kv_cache_mamba).

graph TD;
llama_context --> llama_model;
llama_context --> llama_cparams;
llama_context --> llama_adapter;
llama_context --> etc..;

llama_context[<b>llama_context</b>];

llama_context_no_kv[<b>llama_context_no_kv</b><br><br>];
llama_context_unified[<b>llama_context_unified</b><br><br>llama_kv_cache_unified];
llama_context_standard[<b>llama_context_standard</b><br><br>llama_kv_cache_standard];
llama_context_mamba[<b>llama_context_mamba</b><br><br>llama_kv_cache_mamba];
llama_context_enc_dec[<b>llama_context_enc_dec</b><br><br>llama_kv_cache_standard];

llama_context_no_kv -.-> llama_context;
llama_context_unified -.-> llama_context;
llama_context_standard -.-> llama_context;
llama_context_mamba -.-> llama_context;
llama_context_enc_dec -.-> llama_context;
... -.-> llama_context;

The base llama_context class will implement common functionality such as low-level ggml buffer and backend management + adapters, without the notion of a KV cache. The derived classes will specialize the llama_context for different use-cases.

The llm_build_context would operate only through the llama_build_i interface and the batch processing will respectively only interact with the llama_batch_manager_i interface. The type of llama_context to construct in functions such as llama_init_from_model() would be determined based on the model and the specified context parameters. For example, the user would be able to create both llama_context_unified and llama_context_standard for a LLM_ARCH_QWEN2 model. Or a llama_context_no_kv for an encoding-only LLM_ARCH_BERT model. And so on.

API changes

The current changes are only necessary to make the API more consistent in following the naming convention. To migrate, simply replace the old API calls with the new ones.

Deprecate llama_kv_cache_... API
Add llama_kv_self_... API

In the future, the llama_kv_cache_... API will be changed to work with struct llama_kv_cache instead of struct llama_context and the functionality will be extended to support things like saving, copying, loading, etc.

Notes

Fix build_qwen2vl, inp_pos, lctx.n_pos_per_token hack
Worst case for n_outputs and n_outputs_enc in llm_build_context seem incorrect
Remove inp_s_seq - not used

fix

 const bool           is_sliding = il % sliding_window_pattern < (sliding_window_pattern - 1);
 struct ggml_tensor * KQ_mask_l = is_sliding ? KQ_mask_swa : KQ_mask;

Fix T5
Fix RWKV
Fix batch.pos == NULL - llama_context::pos_max() is used incorrectly
Dedup the reserve code
Errors on unimplemented interface
Build multiple graphs per model (e.g. enc, dec, no-logits, etc.)
Implement causal input for cache-less llama_context
Simplify encode()/decode()
Remove worst_case from the llama_graph_i API?
Wrap input tensors in structs
Add trace logs

PRs to resolve

New features

Load all MoE experts during warmup #11571

ggerganov · 2025-01-14T08:51:48Z

I am thinking about the following API change for this PR:

    // API on `master`
    DEPRECATED(LLAMA_API void llama_kv_cache_clear(ctx));
    DEPRECATED(LLAMA_API bool llama_kv_cache_seq_rm(ctx));
    DEPRECATED(LLAMA_API void llama_kv_cache_seq_cp(ctx));
    DEPRECATED(LLAMA_API void llama_kv_cache_seq_keep(ctx));
    DEPRECATED(LLAMA_API void llama_kv_cache_seq_add(ctx));
    DEPRECATED(LLAMA_API void llama_kv_cache_seq_div(ctx));
    DEPRECATED(LLAMA_API llama_pos llama_kv_cache_seq_pos_max(ctx));
    DEPRECATED(LLAMA_API void llama_kv_cache_defrag(ctx));
    DEPRECATED(LLAMA_API bool llama_kv_cache_can_shift(ctx));
    DEPRECATED(LLAMA_API void llama_kv_cache_update(ctx));

    // works with `ctx.kv_self` - backwards compatible with `master`
    LLAMA_API void llama_kv_self_clear(ctx);
    LLAMA_API bool llama_kv_self_seq_rm(ctx);
    LLAMA_API void llama_kv_self_seq_cp(ctx);
    LLAMA_API void llama_kv_self_seq_keep(ctx);
    LLAMA_API void llama_kv_self_seq_add(ctx);
    LLAMA_API void llama_kv_self_seq_div(ctx);
    LLAMA_API llama_pos llama_kv_self_seq_pos_max(ctx);
    LLAMA_API void llama_kv_self_defrag(ctx);
    LLAMA_API bool llama_kv_self_can_shift(ctx);
    LLAMA_API void llama_kv_self_update(ctx);

    // TODO: llama_kv_cache API
    // can be implemented in a later PR
    // new API to access the KV cache instance
    struct llama_kv_cache;

    LLAMA_API struct llama_kv_cache * llama_get_kv_self(ctx)
    LLAMA_API void                    llama_set_kv_self(ctx, kv);
    // allow to clone, free, save, load the kv cache

ggerganov · 2025-01-16T20:03:06Z

src/llama.cpp

 void llama_kv_self_update(llama_context * ctx) {
-    llama_kv_self_update_impl(*ctx);
+    const bool need_reserve = ctx->kv_self_update();
+
+    // reserve a worst case graph again
+    if (need_reserve) {
+        // TODO: extract to a function
+        const auto & cparams = ctx->cparams;
+        const auto & model   = ctx->model;
+
+        // build worst-case graph
+        uint32_t n_seqs = 1; // TODO: worst-case number of sequences
+        uint32_t n_tokens = std::min(cparams.n_ctx, cparams.n_ubatch);
+
+        llama_token token = model.vocab.token_bos(); // not actually used by llama_build_graph, but required to choose between token and embedding inputs graph
+        llama_ubatch ubatch = { true, n_tokens, n_tokens / n_seqs, n_seqs, &token, nullptr, nullptr, nullptr, nullptr, nullptr};
+
+        ggml_cgraph * gf = llama_build_graph(*ctx, ubatch, true);
+
+        // initialize scheduler with the worst-case graph
+        ggml_backend_sched_reset(ctx->sched.get());
+        if (!ggml_backend_sched_reserve(ctx->sched.get(), gf)) {
+            LLAMA_LOG_ERROR("%s: failed to allocate compute buffers\n", __func__);
+        }
+    }
 }


@slaren If we have a separate scheduler for the kv_self updates (such as K-shift and defrag), would this worst-case reservation be necessary?

No, but that would increase memory usage.

slaren · 2025-01-16T00:50:38Z

include/llama.h

@@ -460,8 +461,9 @@ extern "C" {

    DEPRECATED(LLAMA_API int32_t llama_n_vocab    (const struct llama_vocab * vocab), "use llama_vocab_n_tokens instead");

-    LLAMA_API const struct llama_model * llama_get_model   (const struct llama_context * ctx);
-    LLAMA_API enum llama_pooling_type    llama_pooling_type(const struct llama_context * ctx);
+    LLAMA_API const struct llama_model * llama_get_model   (const struct llama_context * ctx); // TODO: remove const?


llama_model should always be immutable, otherwise it would be hard to guarantee the thread-safety when used in multiple contexts. So returning a const should be correct here.

slaren · 2025-01-16T20:09:04Z

src/llama.cpp

 void llama_kv_self_update(llama_context * ctx) {
-    llama_kv_self_update_impl(*ctx);
+    const bool need_reserve = ctx->kv_self_update();
+
+    // reserve a worst case graph again
+    if (need_reserve) {
+        // TODO: extract to a function
+        const auto & cparams = ctx->cparams;
+        const auto & model   = ctx->model;
+
+        // build worst-case graph
+        uint32_t n_seqs = 1; // TODO: worst-case number of sequences
+        uint32_t n_tokens = std::min(cparams.n_ctx, cparams.n_ubatch);
+
+        llama_token token = model.vocab.token_bos(); // not actually used by llama_build_graph, but required to choose between token and embedding inputs graph
+        llama_ubatch ubatch = { true, n_tokens, n_tokens / n_seqs, n_seqs, &token, nullptr, nullptr, nullptr, nullptr, nullptr};
+
+        ggml_cgraph * gf = llama_build_graph(*ctx, ubatch, true);
+
+        // initialize scheduler with the worst-case graph
+        ggml_backend_sched_reset(ctx->sched.get());
+        if (!ggml_backend_sched_reserve(ctx->sched.get(), gf)) {
+            LLAMA_LOG_ERROR("%s: failed to allocate compute buffers\n", __func__);
+        }
+    }
 }


No, but that would increase memory usage.

src/llama-context.cpp

ggerganov · 2025-01-17T21:20:31Z

@slaren Coming back to your comment from earlier: #11110 (review)

At some point we should abstract eveything needed to model an architecture to a single class (such that each architecture is a subclass of this class)

After that, llm_type should probably be removed entirely, and each architecture should have its own enum if needed, with a function to return the type as a string (which by default could be " ")

In the OP I have outlined a possible approach to make the implementation more abstract. I have focused primarily on the abstraction of the KV cache and the llama context.

If I understand correctly your suggestion, the idea is to have the compute graph build functions for each of the arches (e.g. build_llama()), become part of llama_model (e.g. implement derived classes llama_model_llama, llama_model_qwen, etc..), which would effectively eliminate the need for llm_build_context. This way, the llama_context would be able to simply call model->build(), instead of relying on the graph to come from "outside". Do I understand correctly the idea?

slaren · 2025-01-18T01:04:33Z

I haven't really though enough about this to make specific suggestions, but I think the goal should be to have an interface that can be used to define everything necessary to implement a model architecture. Ideally, to add support for a new architecture, it should only be necessary be to define a new class and create a mapping between the architecture name in the GGUF file and this class. There may of course be more classes in the interface, but there should be a single entry point. So this should include more than just the graph build function, it should also include all the functions to load a model, create a context, and everything else that may be necessary to run a model. This interface would also need to be supported by other interfaces such as the KV cache abstraction, and graph building helper functions that are currently in llm_build_context and the other llm_build_* functions.

To do this, I think it would be better to create an abstract interface that contains everything necessary to define a model architecture. I think that's likely to result in a cleaner and more maintainable codebase than using llama_model as a base class. Instead, llama_model (and other classes like llama_context) should use this interface to implement the functionality in llama.cpp. It may also be convenient to have one or more base classes that implement some of the common functionality that is shared between multiple model architectures, but it should not be strictly necessary to use these base classes.

This is of course a very high level suggestion, it will take a lot of work to define all the details.

ggerganov · 2025-01-20T07:35:12Z

Thanks for the suggestions. I'll aim to create the abstract model interface and restructure the implementation so that the llm_build_context is no longer needed and all model-specific code is behind the new abstract interface. Will keep hacking on this PR for a while and try to bring it in a more complete state before merging.

ggml-ci

ggerganov · 2025-02-20T20:55:04Z

This PR is getting close to completion. Here is an update of the new software architecture:

llama_context is now a base class implementing a cache-less (i.e. without a KV cache) inference. This means that the compute graphs operate on the current batch and there is no state being preserved across encode()/decode() calls. It also provides basic ggml-related functionality such as initializing backends and preparing output buffers for common things, such as logits, embeddings, etc.
The llama_context now also implements a "graph building" interface llama_graph_i. The idea is that every model will utilize this interface to create its compute graphs. For example, where a model requires an attention block, it now calls llama_graph_i::build_attn() and delegates the logic to the specific instance of the llama_context. This is the main change that was needed to decouple the KV cache from the graph build logic and enables the implementation of new KV caches and model architectures.
The base llama_context is used for encoder-only models such as BERT since they do not require a KV cache. But also can work with any decode model as well - it will just be slow because there is no cache.
The llama_context_kv_self class inherits llama_context and adds a llama_kv_cache instance to support self-attention caching. This class is mainly useful for decoder models. In the future, it will support a "per-sequence" KV cache variation which will be utilized in multi-user use cases, because the regular unified KV cache is not optimal in such scenarios.
The llama_context_recurrent class is used for recurrent models, such as RWKV and Mamba. It utilizes a llama_kv_cache_recurrent instance, which currently is temporary implemented as a regular llama_kv_cache. In the future, we will implement a recurrent-specific cache that is suitable for these architectures. (cc @compilade)
Before merging, I need to add an encoder-decoder context for models such as T5. I'm still working out the details, but I am thinking along the lines of composing 2 context - one for the encoder and one for the decoder. This should serve as the basic example for multi-modal support later on.
The model definitions are now fully contained in llama-model.cpp. This includes hparams and data loading + graph builds. We are still utilizing a "build context" like before, but it is now simplified. In follow-up PRs, this will be improved by moving each model definition in a separate class.
The encode()/decode() implementations of each llama_context currently have a lot of duplicated code. These should be improved in the future. The main purpose of these calls is to perform the micro-batching according to the attention implementation of the context and in order to improve this, there are likely changes needed to the llama_ubatch, llama_sbatch and llama_batch objects. So to avoid increasing the scope of this PR even further, these will be reworked later on, likely within llama : private llama_batch #11875.
We can now also implement a MLA-specific llama_context_kv_self_mla for R1 models. It can have a customized llama_kv_cache_mla implementation + extra context state and custom attention implementation. (cc @fairydreaming)

Pinging @MollySophia and @compilade if you could run some tests with this branch to check if the RWKV and Mamba models work correctly.

Any suggestions for improving the code are welcome. Hoping to have this ready for review in the next few days.

ngxson · 2025-02-20T22:13:21Z

I've been quite on/off recently, but hopefully I can have a deeper look into this during the weekend.

ggml-ci

fairydreaming · 2025-02-22T18:24:57Z

The llama_context now also implements a "graph building" interface llama_graph_i. The idea is that every model will utilize this interface to create its compute graphs. For example, where a model requires an attention block, it now calls llama_graph_i::build_attn() and delegates the logic to the specific instance of the llama_context. This is the main change that was needed to decouple the KV cache from the graph build logic and enables the implementation of new KV caches and model architectures.

@ggerganov I see that there is an implicit assumption in build_attn() method that full Q, K and V vectors exist. But the main idea of #11446 is to avoid calculation of full Q K V full vectors. How do you plan to handle this case? Shall I add a separate build_attn_mla() method across the codebase starting from the llama_graph_i interface and going through llama_context up to the new llama_context_kv_self_mla?

ngxson

Looks good overall. Some points I'm thinking about for my vision PR:

Having a derived class llama_vision_context : llama_context as you said
Input image tokens will be obtained via llama_batch_ext, they will be passed to llama_vision_context::input_set which can work with pixel values instead of text token
Output tensor will be saved to llama_context::embd_tensor ==> need to add this to the base class

ngxson · 2025-02-22T21:23:50Z

src/llama.cpp

+            case LLM_ARCH_BERT:
+            case LLM_ARCH_JINA_BERT_V2:
+            case LLM_ARCH_NOMIC_BERT:
+                ctx = new llama_context_enc(*model, params, LLAMA_GRAPH_TYPE_DEFAULT);


Should this be LLAMA_GRAPH_TYPE_ENCODER? (Though I know it's not currently in used)

Yes, likely will change to LLAMA_GRAPH_TYPE_ENCODER to be more explicit, though the idea is to have a "default" graph for each model, which for this case would not make a difference using either "default" or "encoder".

ngxson · 2025-02-22T21:33:43Z

src/llama-context.cpp

+    cparams.offload_kqv      = params.offload_kqv;
+    cparams.flash_attn       = params.flash_attn;
+    cparams.no_perf          = params.no_perf;
+    cparams.pooling_type     = params.pooling_type;


I'm wondering if we can further split these into "environment" params and "inference" params. AFAIK YaRN/RoPE is not used on recurrent models (and plus, vision encoder usually use learned embeddings)

For example:

// "environment" params, meaning it may affect performance, but does not change the result cparams.n_seq_max = std::max(1u, params.n_seq_max); cparams.n_threads = params.n_threads; cparams.n_threads_batch = params.n_threads_batch; cparams.defrag_thold = params.defrag_thold; cparams.embeddings = params.embeddings; cparams.offload_kqv = params.offload_kqv; cparams.flash_attn = params.flash_attn; cparams.no_perf = params.no_perf; // "inference" params, may affect the result cparams.yarn_ext_factor = params.yarn_ext_factor; cparams.yarn_attn_factor = params.yarn_attn_factor; cparams.yarn_beta_fast = params.yarn_beta_fast; cparams.yarn_beta_slow = params.yarn_beta_slow; cparams.rope_freq_base = params.rope_freq_base == 0.0f ? hparams.rope_freq_base_train : params.rope_freq_base; cparams.rope_freq_scale = params.rope_freq_scale == 0.0f ? hparams.rope_freq_scale_train : params.rope_freq_scale; // everything else, not sure how to categorize them: cparams.n_ctx = params.n_ctx == 0 ? hparams.n_ctx_train : params.n_ctx; cparams.n_batch = hparams.causal_attn ? std::min(cparams.n_ctx, params.n_batch) : params.n_batch;

ngxson · 2025-02-22T22:08:15Z

Also, since the model definition is now fully contained inside llama-model.cpp, I'm wondering if we can maybe implement some sort of "graph check" in the future. The idea is to export a graph for a given model as a "snapshot", then have a CI check to make sure they are not inadvertently changed. Ideally, this can be done without loading model weight or creating llama_context / KV cache.

Currently, we can debug the cgraph using ggml_graph_dump_dot, but this requires loading the model weight which will be impossible for very large models (>70B)

I'm not sure if the idea is worth exploring, but I can create a dedicated issue to discuss more if needed.

ggerganov · 2025-02-23T08:27:12Z

The llama_context now also implements a "graph building" interface llama_graph_i. The idea is that every model will utilize this interface to create its compute graphs. For example, where a model requires an attention block, it now calls llama_graph_i::build_attn() and delegates the logic to the specific instance of the llama_context. This is the main change that was needed to decouple the KV cache from the graph build logic and enables the implementation of new KV caches and model architectures.

@ggerganov I see that there is an implicit assumption in build_attn() method that full Q, K and V vectors exist. But the main idea of #11446 is to avoid calculation of full Q K V full vectors. How do you plan to handle this case? Shall I add a separate build_attn_mla() method across the codebase starting from the llama_graph_i interface and going through llama_context up to the new llama_context_kv_self_mla?

@fairydreaming Yes, you should add a new llama_graph_i::build_attn_mla() in case the current signature of llama_graph_i::build_attn() is not suitable for the MLA use case. The default implementation in llama-graph.cpp would be to return an error of not being implemented. You then don't need to make any changes to llama_context - you only need to implement it in llama_context_kv_self_mla.

ggerganov · 2025-02-23T09:00:00Z

Looks good overall. Some points I'm thinking about for my vision PR:

Having a derived class llama_vision_context : llama_context as you said

Input image tokens will be obtained via llama_batch_ext, they will be passed to llama_vision_context::input_set which can work with pixel values instead of text token

Output tensor will be saved to llama_context::embd_tensor ==> need to add this to the base class

Overall yes. The details are not yet clear to me completely - I think once the T5 encoder-decoder use case is implemented we will have a more clear picture and a starting point for multi-modal support.

What I am trying to do is to be able to compose the llama_contexts together. For example, if we look at the Whisper model (because I am familiar with it), it is essentially an Encoder + Decoder. So it should be possible to implement it using the same llama_context_enc_dec that we would use for the T5 models. At the same time, we should also have the option to create the encoder/decoder contexts individually. For example, I would be able to say: create a llama_context_enc from this Whisper model. And this context will be used only for encoding audio to embeddings.

Extending this analogy, a vision model is likely to fit in the same llama_context_enc_dec implementation because it is again an Encoder + Decoder stitched together. The specific input type is not relevant to the context's purpose/implementation.

ggml-ci

fairydreaming · 2025-02-23T17:39:05Z

@ggerganov I think there's still one thing missing. There should be an abstract kv cache interface, llama_kv_cache_i or something like this that caches would implement (and llama_context::get_kv_self() would return this type). I see that you initially planned that llama_kv_cache type would serve as a base type for caches, but as far as I can see this is not implemented (the current llama_kv_cache still contains code specific to caching K/V vectors inside and there's no llama_kv_cache_unified).

ggml-ci

ggerganov · 2025-02-25T14:17:00Z

src/llama-context.cpp

+    // if we have the output embeddings from the encoder, use them directly
+    if (cross->t_embd) {
+        inp.cross_embd = ggml_view_tensor(ctx0, cross->t_embd);
+
+        return inp.cross_embd;
+    }
+


With this change we now use directly the embeddings produced by the encoder (cross->t_embd) as input for the decoder's cross-attention without downloading/uploading to/from RAM.

This seems to work correctly, but in debug it hits the following assert when I explicitly use -dev none on Mac:

./bin/llama-cli \ -m ../models/google-t5-small/ggml-model-f16.gguf \ -p 'Translate from English to German: The house is wonderful.' \ -dev none

0.00.122.688 I llama_context_kv_self: constructing llama_context_kv_self 0.00.122.690 I init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 6, can_shift = 1 0.00.125.481 I init: CPU KV buffer size = 48.00 MiB 0.00.125.485 I llama_context_kv_self: KV self size = 48.00 MiB, K (f16): 24.00 MiB, V (f16): 24.00 MiB 0.00.138.987 I reserve: CPU compute buffer size = 30.00 MiB 0.00.138.988 I reserve: graph nodes = 197 0.00.138.988 I reserve: graph splits = 61 (with bs=512), 5 (with bs=1) 0.00.152.861 I reserve: CPU compute buffer size = 213.00 MiB 0.00.152.862 I reserve: graph nodes = 342 0.00.152.862 I reserve: graph splits = 98 (with bs=512), 18 (with bs=1) 0.00.152.869 I common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096 0.00.152.869 W common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) Assertion failed: (tensor_alloc->offset == SIZE_MAX), function ggml_gallocr_init_tensor, file ggml-alloc.c, line 793. Abort trap: 6

I think it is related to re-using the tensor from the encoder context, but I am not sure if the assert is correct in this case. @slaren Any ideas?

Edit: btw it does not hit the assert either without -dev none or with -dev none -fa

I am not sure exactly what triggers the assert, probably because the graph didn't change except in that one tensor that previously was a view now it isn't, and ggml-alloc is not correctly detecting that the graph changed in an incompatible way. However, I don't think this is correct either way, because to do this you would need to allocate this tensor in a different buffer/sched, it's not possible to use tensors allocated in the compute buffer in the next graph, since the compute buffer is reset with each graph.

github-actions bot added examples server labels Jan 13, 2025

ggerganov force-pushed the gg/llama-kv-cache branch 3 times, most recently from bcfda5c to fb74024 Compare January 14, 2025 11:22

github-actions bot added the android Issues specific to Android label Jan 14, 2025

ggerganov force-pushed the gg/llama-kv-cache branch from 28f1272 to 9027f32 Compare January 16, 2025 18:17

ggerganov commented Jan 16, 2025

View reviewed changes

slaren reviewed Jan 16, 2025

View reviewed changes

ggerganov commented Jan 16, 2025

View reviewed changes

src/llama-context.cpp Outdated Show resolved Hide resolved

ggerganov marked this pull request as ready for review January 17, 2025 21:09

ggerganov requested a review from ngxson as a code owner January 17, 2025 21:09

ggerganov changed the title ~~llama : add struct llama_kv_cache~~ llama : refactor llama_kv_cache, llama_context and llm_build_context Jan 17, 2025

ggerganov force-pushed the gg/llama-kv-cache branch from 60106c6 to e1aaa5e Compare January 20, 2025 07:22

ggerganov marked this pull request as draft January 20, 2025 07:28

ggerganov force-pushed the gg/llama-kv-cache branch from e1aaa5e to a47d389 Compare January 20, 2025 07:30

This was referenced Jan 21, 2025

support Minicpm-omni in image understanding #11289

Merged

Feature Request: MiniMax-Text-01 model #11290

Open

Add support for DeepSeek V3 #11049

Merged

ggerganov added 7 commits January 26, 2025 20:12

llama : add struct llama_kv_cache (wip) [no ci]

f78b396

llama : cont

e4550fb

ggml-ci

kv_cache : functions -> members

4d7bd03

ggml-ci

kv_cache : fix

fef90cb

ggml-ci

kv_cache : minor

73a14ec

context : prepare kv_cache_read/write to be moved to kv_cache

4cd1b6f

ggml-ci

kv_cache : move state read/write to llama_kv_cache

fd05ab8

ggml-ci

context : fix causal input for cache-less case

ad870c4

ggml-ci

ggerganov force-pushed the gg/llama-kv-cache branch from 8bc4a9b to ad870c4 Compare February 20, 2025 18:01

context : add llama_kv_cache_recurrent prototype

08011c2

ggml-ci

ggerganov added 7 commits February 21, 2025 10:28

context : add save/load for recurrent context

2645a7d

ggml-ci

graph : remove worst_case from the API

548c230

ggml-ci

context : add logs

ebf1bdf

ggml-ci

context : wrap input tensors in struct

f588a70

ggml-ci

context : fix n_outputs init

3753b30

ggml-ci

Merge branch 'master' into gg/llama-kv-cache

c4c0a4d

wip enc-dec

f5e8020

ngxson reviewed Feb 22, 2025

View reviewed changes

cont : enc should work now, next is dec

372fa3a

ggml-ci

ggerganov added 9 commits February 23, 2025 19:39

graph : remove the build_kv_... API from llama_graph_i

6378112

ggml-ci

context : remove redundant virtual, protected -> private

0699a44

ggml-ci

context : fix recurrent reserve

a5a85a3

ggml-ci

context : reuse built_attn_mha

4a1054b

ggml-ci

context : explicit llama_context_i abstract interface

9cd78f1

ggml-ci

enc-dec : compose wip

be58e30

ggml-ci

context : enc-dec is now working

e5bc5f8

ggml-ci

context : fix enc-dec state save/load

e2b3294

ggml-ci

context : pass embeddings tensor from encoder to decoder

4efe989

ggml-ci

ggerganov commented Feb 25, 2025

View reviewed changes

ngxson mentioned this pull request Feb 26, 2025

server: Bring back multimodal support #8010

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama : refactor llama_kv_cache, llama_context and llm_build_context #11213

llama : refactor llama_kv_cache, llama_context and llm_build_context #11213

ggerganov commented Jan 13, 2025 •

edited

Loading

ggerganov commented Jan 14, 2025 •

edited

Loading

ggerganov Jan 16, 2025

slaren Jan 16, 2025

slaren Jan 16, 2025

slaren Jan 16, 2025

ggerganov commented Jan 17, 2025 •

edited

Loading

slaren commented Jan 18, 2025

ggerganov commented Jan 20, 2025

ggerganov commented Feb 20, 2025 •

edited

Loading

ngxson commented Feb 20, 2025

fairydreaming commented Feb 22, 2025 •

edited

Loading

ngxson left a comment

ngxson Feb 22, 2025

ggerganov Feb 23, 2025

ngxson Feb 22, 2025

ngxson commented Feb 22, 2025 •

edited

Loading

ggerganov commented Feb 23, 2025

ggerganov commented Feb 23, 2025

fairydreaming commented Feb 23, 2025

ggerganov Feb 25, 2025 •

edited

Loading

slaren Feb 25, 2025 •

edited

Loading

llama : refactor llama_kv_cache, llama_context and llm_build_context #11213

Are you sure you want to change the base?

llama : refactor llama_kv_cache, llama_context and llm_build_context #11213

Conversation

ggerganov commented Jan 13, 2025 • edited Loading

Overview

API changes

Notes

PRs to resolve

New features

ggerganov commented Jan 14, 2025 • edited Loading

ggerganov Jan 16, 2025

Choose a reason for hiding this comment

slaren Jan 16, 2025

Choose a reason for hiding this comment

slaren Jan 16, 2025

Choose a reason for hiding this comment

slaren Jan 16, 2025

Choose a reason for hiding this comment

ggerganov commented Jan 17, 2025 • edited Loading

slaren commented Jan 18, 2025

ggerganov commented Jan 20, 2025

ggerganov commented Feb 20, 2025 • edited Loading

ngxson commented Feb 20, 2025

fairydreaming commented Feb 22, 2025 • edited Loading

ngxson left a comment

Choose a reason for hiding this comment

ngxson Feb 22, 2025

Choose a reason for hiding this comment

ggerganov Feb 23, 2025

Choose a reason for hiding this comment

ngxson Feb 22, 2025

Choose a reason for hiding this comment

ngxson commented Feb 22, 2025 • edited Loading

ggerganov commented Feb 23, 2025

ggerganov commented Feb 23, 2025

fairydreaming commented Feb 23, 2025

ggerganov Feb 25, 2025 • edited Loading

Choose a reason for hiding this comment

slaren Feb 25, 2025 • edited Loading

Choose a reason for hiding this comment

ggerganov commented Jan 13, 2025 •

edited

Loading

ggerganov commented Jan 14, 2025 •

edited

Loading

ggerganov commented Jan 17, 2025 •

edited

Loading

ggerganov commented Feb 20, 2025 •

edited

Loading

fairydreaming commented Feb 22, 2025 •

edited

Loading

ngxson commented Feb 22, 2025 •

edited

Loading

ggerganov Feb 25, 2025 •

edited

Loading

slaren Feb 25, 2025 •

edited

Loading